{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "<!--BOOK_INFORMATION-->\n",
    "<a href=\"https://www.packtpub.com/big-data-and-business-intelligence/machine-learning-opencv\" target=\"_blank\"><img align=\"left\" src=\"data/cover.jpg\" style=\"width: 76px; height: 100px; background: white; padding: 1px; border: 1px solid black; margin-right:10px;\"></a>\n",
    "*This notebook contains an excerpt from the book [Machine Learning for OpenCV](https://www.packtpub.com/big-data-and-business-intelligence/machine-learning-opencv) by Michael Beyeler.\n",
    "The code is released under the [MIT license](https://opensource.org/licenses/MIT),\n",
    "and is available on [GitHub](https://github.com/mbeyeler/opencv-machine-learning).*\n",
    "\n",
    "*Note that this excerpt contains only the raw code - the book is rich with additional explanations and illustrations.\n",
    "If you find this content useful, please consider supporting the work by\n",
    "[buying the book](https://www.packtpub.com/big-data-and-business-intelligence/machine-learning-opencv)!*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "<!--NAVIGATION-->\n",
    "< [Compressing Color Spaces Using k-Means](08.02-Compressing-Color-Images-Using-k-Means.ipynb) | [Contents](../README.md) | [Implementing Agglomerative Hierarchical Clustering](08.04-Implementing-Agglomerative-Hierarchical-Clustering.ipynb) >"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Classifying handwritten digits using k-means\n",
    "\n",
    "Although the last application was a pretty creative use of $k$-means, we can do better still.\n",
    "We have previously discussed k-means in the context of unsupervised learning, where we\n",
    "tried to discover some hidden structure in the data.\n",
    "\n",
    "However, doesn't the same concept apply to most classification tasks? Let's say our task was\n",
    "to classify handwritten digits. Don't most zeros look similar, if not the same? And don't all\n",
    "zeros look categorically different from all possible ones? Isn't this exactly the kind of\n",
    "\"hidden structure\" we set out to discover with unsupervised learning? Doesn't this mean we\n",
    "could use clustering for classification as well?\n",
    "\n",
    "Let's find out together. In this section, we will attempt to use k-means to try and classify\n",
    "handwritten digits. In other words, we will try to identify similar digits without using the\n",
    "original label information."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Loading the dataset\n",
    "\n",
    "From the earlier chapters, you might recall that scikit-learn provides a whole range of\n",
    "handwritten digits via its `load_digits` utility function. The dataset consists of 1,797\n",
    "samples with 64 features each, where each of the features has the brightness of one pixel in\n",
    "an 8 x 8 image:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(1797, 64)"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.datasets import load_digits\n",
    "digits = load_digits()\n",
    "digits.data.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Running k-means\n",
    "\n",
    "Setting up $k$-means works exactly the same as in the previous examples. We tell the\n",
    "algorithm to perform at most 10 iterations and stop the process if our prediction of the\n",
    "cluster centers does not improve within a distance of 1.0:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "import cv2\n",
    "criteria = (cv2.TERM_CRITERIA_EPS + cv2.TERM_CRITERIA_MAX_ITER, 10, 1.0)\n",
    "flags = cv2.KMEANS_RANDOM_CENTERS"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then we apply $k$-means to the data as we did before. Since there are 10 different digits (0-9),\n",
    "we tell the algorithm to look for 10 distinct clusters:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "compactness, clusters, centers = cv2.kmeans(digits.data.astype(np.float32), 10, None, criteria, 10, flags)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And done!\n",
    "\n",
    "Similar to the $N \\times 3$ matrix that represented different RGB colors, this time, the centers array\n",
    "consists of $N \\times 8 \\times 8$ center images, where $N$ is the number of clusters. Therefore, if we want\n",
    "to plot the centers, we have to reshape the `centers` matrix back into 8 x 8 images:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAjwAAADoCAYAAADrPz6KAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAD5JJREFUeJzt3Xus1wX9x/H3IY4c8ICAWqGIh2rzkqtcF8NEbTYKazr/\nEB2N1brMyDWbdl2buv6qJv2Rk+we/7RFk7mWrotbDiU2aVS2VjkrUrJRxBDlJpfz+6fffyq82JfO\nvu89Hv/6hA/1Pd/Dq6/rvEcmJycLAKCzaVP9BwAAONkMHgCgPYMHAGjP4AEA2jN4AID2DB4AoL3p\nx/jn/j/rU29kgL+X13PqDer19FpOPe/NXrw3+3jR1/JYgydy5MiRqF+3bl3U33XXXVG/a9euqK+q\nuuWWW6L+k5/8ZNTPnDkz6ofJzp07o37FihVRf84550R9VdWXvvSlqF+wYEH8jGGQ/rytzZs3R/3N\nN98c9WNjY1FfVfWVr3wl6t/+9rdH/ejoaNR39pe//CXqr7/++vgZf//736P+e9/7XtRfffXVx91O\nnz7Qvwoj6d9TH/vYx6L+V7/6VdTv2bMn6quqzj777Khfs2ZN1Cev5cvxr7QAgPYMHgCgPYMHAGjP\n4AEA2jN4AID2DB4AoD2DBwBoz+ABANozeACA9gb64yWfeuqpqP/iF78Y9aeddlrUX3HFFVFfVfX1\nr3896pcvXx71b3rTm6J+KqU/nfdb3/pW1Kc/zfdEfprr+Ph4/Gs6OnDgQNT/4Q9/iPpt27ZF/e7d\nu6O+qmrTpk1R/+Y3vznqh+knLafvzSeeeCLq77zzzqj/zW9+E/VVVa985SujPv1J7kePHo36qfLo\no49G/Y9+9KOoP//886P+uuuui/qqqoULF0b94sWL42cMgk94AID2DB4AoD2DBwBoz+ABANozeACA\n9gweAKA9gwcAaM/gAQDaM3gAgPYMHgCgPYMHAGhvoLe0nn/++ag/66yzov6aa66J+l27dkV9VdXG\njRujftasWfEzhkX6ev7gBz+I+htuuCHqV65cGfVVVbNnz45/TUeHDh2K+vQO0bRpJ/9/O5166qlR\n/7/4M02V9G7h7bffHvUPP/xw1E9MTER9Vf79ecGCBVE/ffpA/3o7aZ555pmonzFjRtTfeuutUf+u\nd70r6quq5s2bF/Vz5syJnzEIfb8jAAD8l8EDALRn8AAA7Rk8AEB7Bg8A0J7BAwC0Z/AAAO0ZPABA\newYPANCewQMAtGfwAADtDfTYyDnnnBP1l112WdSvXbs26nfs2BH1VVWf//zno37RokXxM4ZFeuNl\n586dUf/qV7866rdu3Rr1VVWvec1roj59PUdGRqJ+qhw5ciTq//3vf0f9vn37on5sbCzqq6oWL14c\n9cPy2pyIJ598Muo3bdoU9emto/TrpSq/1Xf22WdH/bDcUkv/u07/c919991R/7e//S3qq6o++MEP\nRr1bWgAAJ4nBAwC0Z/AAAO0ZPABAewYPANCewQMAtGfwAADtGTwAQHsGDwDQnsEDALRn8AAA7Rk8\nAEB7Az0eOnPmzKhPD4g9/fTTUT937tyor6paunRp1M+YMSN+xrBIDwLu2rUr6jds2BD1P/vZz6K+\nKj84eeedd0b9G97whqifKunx0PQQ7IEDB6I+PTRclR+bnZycjJ8xLCYmJqL+pptuivqDBw9G/Q9/\n+MOor6p6xSteEfVTdXDyZFuyZEnUr169OurTY6B/+tOfor6qavPmzVGfHoI99dRTo/6l+IQHAGjP\n4AEA2jN4AID2DB4AoD2DBwBoz+ABANozeACA9gweAKA9gwcAaM/gAQDaM3gAgPYGektr9+7dUf/A\nAw9E/Uc/+tGoP3z4cNRXVd13331Rf/HFF0f9/Pnzo34qpfeORkdHo/68886L+hUrVkR9VdX9998f\n9XfffXfU33vvvcfdpreDBmnatOx/2+zfvz/qp0/PvpWcyG2csbGxqB8ZGYmfMSzOOOOMqF+5cmXU\nb9u2LerXr18f9VX518D4+Hj8jGGwcOHCqP/Upz4V9Zs2bYr69MZhVdUzzzwT9eltv0HxCQ8A0J7B\nAwC0Z/AAAO0ZPABAewYPANCewQMAtGfwAADtGTwAQHsGDwDQnsEDALRn8AAA7Q30ltbk5GTUp/d6\nZs2aFfXPPfdc1FdVbd++Per/9a9/RX3nW1rLli2L+ieeeCLqT+Q20q5du6J+7969UX/w4MHjbtOv\n30GaPXt21J911llRn95FmjlzZtRXVR04cCDq0/thw2TOnDlRn37t/ec//4n6E/lem34NpH9fDIv0\n63rjxo1R/+CDD0b9H//4x6ivqrr66quj/pRTTomfMQh9vyMAAPyXwQMAtGfwAADtGTwAQHsGDwDQ\nnsEDALRn8AAA7Rk8AEB7Bg8A0J7BAwC0Z/AAAO0N9JbWmWeeGfU333xz1N9xxx1Rf+jQoaivqrrl\nllui/rTTToufMSzSW0T33HNP1H/4wx+O+lWrVkV9VdXExETUf+1rX4v6sbGxqJ8qo6OjUX/llVdG\nfXrf50Tem+mds87Su3Lp65/e/Dv99NOjviq/75beahwWhw8fjvrvfve7Uf/www9H/W233Rb1VVXL\nly+P+qn6vukTHgCgPYMHAGjP4AEA2jN4AID2DB4AoD2DBwBoz+ABANozeACA9gweAKA9gwcAaM/g\nAQDaGznGfZKex0uGS3Y05+V5PafeoF5Pr+XU897sxXuzjxd9LY81eAAAhp5/pQUAtGfwAADtGTwA\nQHsGDwDQnsEDALRn8AAA7Rk8AEB7Bg8A0J7BAwC0Z/AAAO0ZPABAewYPANCewQMAtGfwAADtGTwA\nQHsGDwDQnsEDALRn8AAA7Rk8AEB7Bg8A0J7BAwC0Z/AAAO0ZPABAewYPANDe9GP888n/yZ+ClzMy\nwN/L6zn1BvV6ei2nnvdmL96bfbzoa3mswRN54YUXov7222+P+nvvvTfqZ8+eHfVVVZ/+9Kej/iMf\n+UjUz5o1K+qHyU9+8pOo/8AHPhD1r33ta6O+qmrt2rVR/5a3vCV+RkeHDh2K+i9/+ctRv2HDhqiv\nqrrvvvui/txzz436adOG5wPv3bt3R/2qVauiPn0vf/azn436qqrPfe5zUT937tz4GcPg6NGjUf/j\nH/846m+77bao3759e9RXVb3xjW+M+k984hNR/773ve+423nz5r3kPxuedzgAwAkyeACA9gweAKA9\ngwcAaM/gAQDaM3gAgPYMHgCgPYMHAGjP4AEA2hvoT1p+7LHHov6b3/xm1F933XVR/89//jPqq6rW\nr18f9TfccEPUD9NPWn722WejfvXq1VF/5MiRqD98+HDUV1XddNNNUf/QQw9F/cv9VM9htmXLlqhf\ns2ZN1Kc/Zbuq6pRTTon6PXv2RP34+Phxt9OnD/RbZyz9SciPPPJI1E9MTET9z3/+86ivyr+fX3LJ\nJfEzhkH6ffYb3/hG1L/qVa+K+osuuijqq6q2bt0a9fv27Yv6GTNmRP1L8QkPANCewQMAtGfwAADt\nGTwAQHsGDwDQnsEDALRn8AAA7Rk8AEB7Bg8A0J7BAwC0Z/AAAO0N9CBMel9mxYoVUX/ttddG/fe/\n//2or8rvO42MjMTPGBa//vWvo3779u1Rv27duqhfsmRJ1FdVvfe97436xx9/POqvuOKKqJ8qe/fu\njfo77rgj6i+44IKof8973hP1VVW//OUvo37BggVRf/nll0f9VEruflVVfehDH4r6xYsXR/3atWuj\nvqrqhRdeiH9NR5OTk1F/4403Rv3SpUuj/sEHH4z6qqqnn3466i+88MKonzlzZtS/FJ/wAADtGTwA\nQHsGDwDQnsEDALRn8AAA7Rk8AEB7Bg8A0J7BAwC0Z/AAAO0ZPABAewYPANDeQG9pTUxMRP1FF10U\n9ffcc0/Ub9q0Keqr8rsj+/fvj58xLNLbWAsXLoz6Sy+9NOrTr6+qqksuuSTqt2zZEvXDckvrpz/9\nadQ/9NBDUf/tb3876n//+99HfVXVL37xi6i/5pprov6qq66K+ql05ZVXRv25554b9Q888EDUn8hd\nrDlz5sS/pqNp07LPHdI7aul7/zvf+U7UV1UdOHAg6nfs2BH1g7pZ6RMeAKA9gwcAaM/gAQDaM3gA\ngPYMHgCgPYMHAGjP4AEA2jN4AID2DB4AoD2DBwBoz+ABANozeACA9gZ6PPTMM8+M+muvvTbq0+OR\nF1xwQdRXVd1///1R/+c//znq0yN+U2nfvn1Rf8YZZ0R9+vUyfXr+5To6Ohr1zz//fPyMYZAeEExt\n3Lgx6h999NH4Gekx25UrV8bPGBZz586N+vS9tnXr1qhftGhR1FdVzZ8/P/41HaWHVx955JGoTw/B\nPvnkk1FfVfXOd74z6ufNmxc/YxB8wgMAtGfwAADtGTwAQHsGDwDQnsEDALRn8AAA7Rk8AEB7Bg8A\n0J7BAwC0Z/AAAO0ZPABAewO9pXX06NGoHx8fj/qrrroq6l/3utdFfVV+p+Sxxx6L+mXLlkX9VLrw\nwgujftu2bVG/Z8+eqN+7d2/UV1X97ne/i/q3vvWt8TOmyuTk5HG3r3/966Pf+21ve1vUb9myJer/\n+te/Rn1V1bvf/e6T2g+Tw4cPR/1vf/vbqN+5c2fUL1myJOqrqnbv3h31s2fPjvr03thUmTFjRtRf\nfvnlUf/4449HfXp3rarqC1/4QtRfeuml8TMGwSc8AEB7Bg8A0J7BAwC0Z/AAAO0ZPABAewYPANCe\nwQMAtGfwAADtGTwAQHsGDwDQnsEDALQ30Fta+/fvj/o1a9ZE/aJFi6L+ueeei/qqqu3bt0d9et8l\nuTc2bdrU7tGLL7446tM/78c//vGoP3jwYNRX5Tebli5dGj9jGKxatSrqL7vssqjfsGFD1P/jH/+I\n+qqqW2+9NeoXLFgQP2NYpHeo7rrrrqhPb9A9++yzUV+V3y1M35uf+cxnjrs9/fTTo997kObMmRP1\n559/ftSnd9He//73R31V1Tve8Y6oHxsbi58xCD7hAQDaM3gAgPYMHgCgPYMHAGjP4AEA2jN4AID2\nDB4AoD2DBwBoz+ABANozeACA9gweAKC9gd7SmjlzZtTv27cv6levXh31yd2q/3f99ddH/fLly6N+\nmG5ppTde1q1bF/U33nhj1E+fnn+5fvWrX4368847L37GVBkZGTnuNr0VND4+HvXr16+P+vQuXlV+\nQ6iz9L2Qfm/es2dP1D/11FNRX1U1f/78qN+xY0fUJ3f0pvKW1uTkZNRv3rw56tO/B5ctWxb1VVWj\no6Pxr5kKPuEBANozeACA9gweAKA9gwcAaM/gAQDaM3gAgPYMHgCgPYMHAGjP4AEA2jN4AID2DB4A\noL2RY9zxyI58cDIc/8GkY/N6Tr1BvZ5ey6nnvdmL92YfL/paHmvwAAAMPf9KCwBoz+ABANozeACA\n9gweAKA9gwcAaO//AB9a8IupeEUTAAAAAElFTkSuQmCC\n",
      "text/plain": [
       "<matplotlib.figure.Figure at 0x1e277279ef0>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "plt.style.use('ggplot')\n",
    "%matplotlib inline\n",
    "fig, ax = plt.subplots(2, 5, figsize=(10, 4))\n",
    "centers = centers.reshape(10, 8, 8)\n",
    "for axi, center in zip(ax.flat, centers):\n",
    "    axi.set(xticks=[], yticks=[])\n",
    "    axi.imshow(center, interpolation='nearest', cmap=plt.cm.binary)\n",
    "plt.savefig('digits.png')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Look familiar?\n",
    "\n",
    "Remarkably, $k$-means was able to partition the digit images not just into any 10 random\n",
    "clusters, but into the digits 0-9! In order to find out which images were grouped into which\n",
    "clusters, we need to generate a labels vector as we know it from supervised learning\n",
    "problems:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "from scipy.stats import mode\n",
    "\n",
    "labels = np.zeros_like(clusters.ravel())\n",
    "for i in range(10):\n",
    "    mask = (clusters.ravel() == i)\n",
    "    labels[mask] = mode(digits.target[mask])[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then we can calculate the performance of the algorithm using scikit-learn's\n",
    "accuracy_score metric:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.78464106844741233"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.metrics import accuracy_score\n",
    "accuracy_score(digits.target, labels)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Remarkably, $k$-means achieved 78.4% accuracy without knowing the first thing about the\n",
    "labels of the original images!\n",
    "\n",
    "We can gain more insights about what went wrong and how by looking at the **confusion\n",
    "matrix**. The confusion matrix is a 2D matrix $C$, where every element $C_{i,j}$ is equal to the\n",
    "number of observations known to be in group (or cluster) $i$, but predicted to be in group $j$.\n",
    "Thus, all elements on the diagonal of the matrix represent data points that have been\n",
    "correctly classified (that is, known to be in group $i$ and predicted to be in group $i$). Off-diagonal\n",
    "elements show misclassifications.\n",
    "\n",
    "In scikit-learn, creating a confusion matrix is essentially a one-liner:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[177,   0,   0,   0,   1,   0,   0,   0,   0,   0],\n",
       "       [  0, 154,  25,   0,   0,   1,   2,   0,   0,   0],\n",
       "       [  1,   3, 147,  11,   0,   0,   0,   3,  12,   0],\n",
       "       [  0,   1,   2, 159,   0,   2,   0,   9,  10,   0],\n",
       "       [  0,  12,   0,   0, 162,   0,   0,   5,   2,   0],\n",
       "       [  0,   0,   0,  40,   2, 138,   2,   0,   0,   0],\n",
       "       [  1,   2,   0,   0,   0,   0, 177,   0,   1,   0],\n",
       "       [  0,  14,   0,   0,   0,   0,   0, 164,   1,   0],\n",
       "       [  0,  23,   3,   8,   0,   5,   1,   2, 132,   0],\n",
       "       [  0,  21,   0, 145,   0,   5,   0,   8,   1,   0]])"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.metrics import confusion_matrix\n",
    "confusion_matrix(digits.target, labels)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The confusion matrix tells us that $k$-means did a pretty good job at classifying data points\n",
    "from the first nine classes; however, it confused all nines to be (mostly) threes. Still, this\n",
    "result is pretty solid, given that the algorithm had no target labels to be trained on."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "<!--NAVIGATION-->\n",
    "< [Compressing Color Spaces Using k-Means](08.02-Compressing-Color-Images-Using-k-Means.ipynb) | [Contents](../README.md) | [Implementing Agglomerative Hierarchical Clustering](08.04-Implementing-Agglomerative-Hierarchical-Clustering.ipynb) >"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
